Improving the Load Balance of MapReduce Operations based on the Key Distribution of Pairs

نویسندگان

  • Liya Fan
  • Bo Gao
  • Xi Sun
  • Fa Zhang
  • Zhiyong Liu
چکیده

Load balance is important for MapReduce to reduce job duration, increase parallel efficiency, etc. Previous work focuses on coarse-grained scheduling. This study concerns finegrained scheduling on MapReduce operations. Each operation represents one invocation of the Map or Reduce function. Scheduling MapReduce operations is difficult due to highly skewed operation loads, no support to collect workload statistics, and high complexity of the scheduling problem. So current implementations adopt simple strategies, leading to poor load balance. To address these difficulties, we design an algorithm to schedule operations based on the key distribution of intermediate pairs. The algorithm involves a sub-program for selecting operations for task slots, and we name it the Balanced Subset Sum (BSS) problem. We discuss properties of BSS and design exact and approximation algorithms for it. To transparently incorporate these algorithms into MapReduce, we design a communication mechanism to collect statistics, and a pipeline within Reduce tasks to increase resource utilization. To the best of our knowledge, this is the first work on scheduling MapReduce workload at this fine-grained level. Experiments on PUMA [T+12] benchmarks show consistent performance improvement. The job duration can be reduced by up to 37%, compared with standard MapReduce. Keywordsparallel computing; Cloud computing; MapReduce; load balance

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

OS4M: Achieving Global Load Balance of MapReduce Workload by Scheduling at the Operation Level

The efficiency of MapReduce is closely related to its load balance. Existing works on MapReduce load balance focus on coarse-grained scheduling. This study concerns finegrained scheduling on MapReduce operations, with each operation representing one invocation of the Map or Reduce function. By default, MapReduce adopts the hash-based method to schedule Reduce operations, which often leads to po...

متن کامل

Online Distribution and Load Balancing Optimization Using the Robin Hood and Johnson Hybrid Algorithm

Proper planning of assembly lines is one of the production managers’ concerns at the tactical level so that it would be possible to use the machine capacity, reduce operating costs and deliver customer orders on time. The lack of an efficient method in balancing assembly line can create threatening problems for manufacturing organizations. The use of assembly line balancing methods cannot balan...

متن کامل

Multi Objective Optimization Placement of DG Problem for Different Load Levels on Distribution Systems with Purpose Reduction Loss, Cost and Improving Voltage Profile Based on DAPSO Algorithm

Along with economic growth of countries which leads to their increased energy requirements,the problem of power quality and reliability of the networks have been more considered andin recent decades, we witnessed a noticeable growing trend of distributed generation sources(DG) in distribution networks. Occurrence of DG in distribution systems, in addition tochanging the utilization of these sys...

متن کامل

Smart load shedding in distribution networks considering the importance of loads

One of the most important tasks of operators in distribution companies is to restoration after a fault in the network. In load restoration schemes, in addition to observing the load flow constraints, it is necessary to maintain the grid structure radially and, most importantly, observing the balance of consumption with the possibility of providing load due to the limitations of the backup feede...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1401.0355  شماره 

صفحات  -

تاریخ انتشار 2014